Purpose

This script takes a deep dive into Landsat 5 labels for a more rigorous analysis of outliers in the label dataset and whether or not we can glean anything from the metadata to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.

harmonize_version = "v2024-04-17"
outlier_version = "v2024-04-17"

LS5 <- read_csv(paste0("data/labels/harmonized_LS57_labels_", harmonize_version, ".csv")) %>% 
  filter(mission == "LANDSAT_5")

Check for mis-matched band data between user data and re-pull

Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There is some mis-match here, let’s look at those data, in this case, we’ll just use B7/SR_B7 as a reference to filter inconsistent labels

LS5_inconsistent <- LS5 %>% 
  filter((is.na(SR_B7) | SR_B7 != B7))

LS5_inconsistent %>% 
  group_by(class) %>% 
  summarise(n_labels = n())
## # A tibble: 6 × 2
##   class                  n_labels
##   <chr>                     <int>
## 1 cloud                       213
## 2 darkNearShoreSediment         1
## 3 lightNearShoreSediment        2
## 4 openWater                     9
## 5 other                         1
## 6 shorelineContamination        2

Most of these are cloud labels, where the pixel is saturated, and then masked in the re-pull value (resulting in an NA). Let’s drop those from this subset and then look more.

LS5_inconsistent <- LS5_inconsistent %>% 
  filter(!(class == "cloud" & is.na(SR_B7))) 

This leaves ~0.8% of the Landsat 5 labels as inconsistent. Let’s do a quick sanity check to make sure that we’ve dropped values that are inconsistent between pulls:

LS5_filtered <- LS5 %>% 
  filter(mission == "LANDSAT_5",
         # filter data where SR_B7 has data and where the values match between the two
         # pulls.
         (!is.na(SR_B7) & SR_B7 == B7) | 
           # or where the user-specified class is cloud and the pixel was saturated
           # providing no surface refelctance data
           (class == "cloud" & is.na(SR_B7) ),
         # or where any re-pulled band value is greater than 1
         if_all(LS57_ee,
                ~ . <= 1)) 

And plot:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There’s still a couple of mis-matched data points in here, seen in B4, so we’ll put those over in inconsistent, too.

LS5_inconsistent <-  LS5_filtered %>% 
  filter(B4 != SR_B4) %>% 
  bind_rows(., LS5_inconsistent)
LS5_filtered <- LS5_filtered %>% 
  filter(B4 == SR_B4) 

And now let’s look at the data by class:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI), so let’s drop those categories and look at the data again.

LS5_class_analysis <- LS5_filtered %>% 
  filter(!(class %in% c("other", "shorelineContamination")))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

Check for systemic volunteer inconsistencies

Let’s also go back and check to see if there is any pattern to the inconsistent labels.

## # A tibble: 4 × 3
##   vol_init n_tot_labs n_dates
##   <chr>         <int>   <int>
## 1 HAD               8       4
## 2 BGS               7       3
## 3 LRCP              5       2
## 4 AMP               1       1

There seem to be just a few inconsistencies here and across multiple dates. This could just be a processing difference (if there happened to be an update to a scene since users pulled these data or if these were on an overlapping portion of two scenes). I’m not concerned about any systemic errors here that might require modified data handling for a specific scene or contributor.

Outlier handling

There are definitely outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm.

vertical_data <- LS5_class_analysis %>% 
  pivot_longer(LS57_ee,
             names_to = "band_name",
             values_to = "value") %>% 
  rowid_to_column()
vert_out <- vertical_data %>% 
  select(user_label_id, rowid, date, class, band_name, value, vol_init) %>% 
  group_by(class, band_name) %>% 
  identify_outliers(., value) %>% 
  filter(is.extreme)
outliers <- vert_out  %>% 
  left_join(vertical_data) %>%
  select(-rowid)%>% 
  pivot_wider(names_from = band_name,
              values_from = value,
              values_fn = max)

print("Classes represented in outliers:")
## [1] "Classes represented in outliers:"
unique(outliers$class)
## [1] "darkNearShoreSediment"  "lightNearShoreSediment" "offShoreSediment"      
## [4] "openWater"

Okay, 84 outliers out of 2125 - and they are all from non-cloud groups. Are there any contributor outliers?

LS5_vol <- LS5_class_analysis %>% 
  filter(class != "cloud") %>% 
  group_by(vol_init) %>% 
  summarise(n_tot = n()) %>% 
  arrange(-n_tot)
LS5_out_vol <- outliers %>% 
  group_by(vol_init) %>% 
  summarise(n_out = n()) %>% 
  arrange(-n_out)
full_join(LS5_vol, LS5_out_vol) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier)
## # A tibble: 7 × 4
##   vol_init n_tot n_out percent_outlier
##   <chr>    <int> <int>           <dbl>
## 1 BGS        592    59            9.97
## 2 AMP         58     5            8.62
## 3 LRCP       115     8            6.96
## 4 KRW         32     2            6.25
## 5 HAD        352    10            2.84
## 6 FYC         63    NA           NA   
## 7 LAE         15    NA           NA

Meh, not great, not terrible.

How many of these outliers are in specific scenes?

LS5_out_date <- outliers %>% 
  group_by(date, vol_init) %>% 
  summarize(n_out = n())
LS5_date <- LS5_class_analysis %>% 
  filter(class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_tot = n())
LS5_out_date <- left_join(LS5_out_date, LS5_date) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier)
LS5_out_date
## # A tibble: 13 × 5
## # Groups:   date [13]
##    date       vol_init n_out n_tot percent_outlier
##    <date>     <chr>    <int> <int>           <dbl>
##  1 1988-11-23 BGS         28    67           41.8 
##  2 1993-06-30 HAD          9    24           37.5 
##  3 1998-10-02 BGS         15    98           15.3 
##  4 2010-09-01 AMP          5    58            8.62
##  5 1998-08-31 LRCP         8   115            6.96
##  6 2000-08-04 BGS          5    74            6.76
##  7 1984-07-07 KRW          2    32            6.25
##  8 2006-07-04 BGS          3    58            5.17
##  9 2009-04-23 BGS          2    53            3.77
## 10 2004-08-15 BGS          2    55            3.64
## 11 2006-06-02 BGS          3   105            2.86
## 12 2007-05-04 HAD          1    57            1.75
## 13 2010-04-10 BGS          1    79            1.27

There are two scenes here that have very high outliers - perhaps there is something about the AC in these particular scenes? or the general scene quality?

LS5_out_date %>% 
  filter(percent_outlier > 20) %>% 
  select(date, vol_init) %>% 
  left_join(., LS5)
## # A tibble: 239 × 45
## # Groups:   date [2]
##    date       vol_init mission class user_label_id    B1    B2    B3    B4    B5
##    <date>     <chr>    <chr>   <chr>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 1988-11-23 BGS      LANDSA… open…          4080 0.1   0.079 0.062 0.045 0.015
##  2 1988-11-23 BGS      LANDSA… open…          4081 0.092 0.089 0.062 0.045 0.015
##  3 1988-11-23 BGS      LANDSA… open…          4082 0.103 0.089 0.071 0.045 0.015
##  4 1988-11-23 BGS      LANDSA… open…          4083 0.091 0.089 0.054 0.045 0.021
##  5 1988-11-23 BGS      LANDSA… open…          4084 0.105 0.089 0.054 0.054 0.009
##  6 1988-11-23 BGS      LANDSA… open…          4085 0.099 0.096 0.06  0.058 0.016
##  7 1988-11-23 BGS      LANDSA… open…          4086 0.099 0.096 0.061 0.05  0.016
##  8 1988-11-23 BGS      LANDSA… open…          4087 0.104 0.087 0.076 0.05  0.01 
##  9 1988-11-23 BGS      LANDSA… open…          4088 0.107 0.086 0.068 0.049 0.016
## 10 1988-11-23 BGS      LANDSA… open…          4089 0.098 0.076 0.06  0.049 0.016
## # ℹ 229 more rows
## # ℹ 35 more variables: B7 <dbl>, SR_ATMOS_OPACITY <dbl>, SR_B1 <dbl>,
## #   SR_B2 <dbl>, SR_B3 <dbl>, SR_B4 <dbl>, SR_B5 <dbl>, SR_B7 <dbl>,
## #   SR_CLOUD_QA <dbl>, ST_ATRAN <dbl>, ST_B6 <dbl>, ST_CDIST <dbl>,
## #   ST_DRAD <dbl>, ST_EMIS <dbl>, ST_EMSD <dbl>, ST_QA <dbl>, ST_TRAD <dbl>,
## #   ST_URAD <dbl>, QA_PIXEL <dbl>, QA_RADSAT <dbl>, `system:index` <chr>,
## #   `system:time_start` <dttm>, .geo <chr>, CLOUD_COVER <lgl>, …

This isn’t terribly helpful, there is a range of IMAGE_QUALITY, there are ranges of CLOUD COVER, otherwise metadata is basically the same (but to broad to make a distinction for exclusion) While I feel fine dropping these two scenes from the training dataset, we need a systematic way to make sure we don’t apply the algo to similar scenes. If there isn’t anything common in the metadata, we would need to mask pixels by band ranges that are present in the training set (as those are the ones we have confidence in) or flag pixels that we aren’t confident in after the fact (because they are outside of the range used for the training dataset).

How many bands are represented in each labeled point? If there are outliers amongst the RGB bands (how users labeled data), there is probably a systemic problem. If the outliers are in singular bands, especially those that are not in the visible spectrum, we can dismiss the individual observations, and probably assert that the scene as a whole is okay to use in training.

vert_out %>%
  group_by(date, class, vol_init, user_label_id) %>% 
  summarise(n_bands_out = n(),
            bands_out = paste(band_name, collapse = "; ")) %>% 
  filter(n_bands_out >= 3) %>% 
  ungroup(class, user_label_id) %>% 
  summarise(n_labels = n()) %>% 
  arrange(-n_labels)
## # A tibble: 6 × 3
## # Groups:   date [6]
##   date       vol_init n_labels
##   <date>     <chr>       <int>
## 1 1988-11-23 BGS            21
## 2 1998-10-02 BGS             3
## 3 1998-08-31 LRCP            2
## 4 2006-07-04 BGS             2
## 5 2006-06-02 BGS             1
## 6 2007-05-04 HAD             1

At the very least, there are issues in the 1988-11-23 scene… but is that user error? Atmospheric correction?

This looks super hazy, so that may be the issue with this particular scene. Unfortunately, there is no way to know this from the metadata. The SR_ATMOS_OPACITY value is low (this is supposed to indicate the skies are clear), the cirrus confidence is low (but it’s also low for a majority of the labels in this dataset). Reanalysis for this, or dense dark vegetation estimation for aerosol stand-in in LEDAPS may be the culprit - or it could be something to do with surrounding bright snow-covered land that impacts the Rrs values of pixels. This is probably a reason to avoid categorizing LS5 images during winter months.

Do any of these have QA pixel indications of cloud or cloud shadow?

LS5_class_analysis %>% 
  mutate(QA = case_when(str_sub(QA_PIXEL_binary, 1, 2) %in% c(10, 11)  ~ "cirrus",
                   str_sub(QA_PIXEL_binary, 3, 4) %in% c(10, 11)  ~ "snow/ice",
                   str_sub(QA_PIXEL_binary, 5, 6) %in% c(10, 11)  ~ "cloud shadow",
                   str_sub(QA_PIXEL_binary, 7, 8) %in% c(10, 11)  ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n())
## # A tibble: 5 × 2
##   QA           n_tot
##   <chr>        <int>
## 1 cirrus          11
## 2 clear           43
## 3 cloud         1103
## 4 cloud shadow     8
## 5 snow/ice        62

Well, that’s also not helpful, considering that most of the label dataset has at least a medium confidence cloud QA flag as the pixel QA.

LS5_class_analysis %>% 
  mutate(QA = case_when(str_sub(QA_PIXEL_binary, 1, 2) == 11 ~ "cirrus",
                   str_sub(QA_PIXEL_binary, 3, 4) == 11 ~ "snow/ice",
                   str_sub(QA_PIXEL_binary, 5, 6)  == 11 ~ "cloud shadow",
                   str_sub(QA_PIXEL_binary, 7, 8)  == 11 ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n())
## # A tibble: 5 × 2
##   QA           n_tot
##   <chr>        <int>
## 1 cirrus          11
## 2 clear         1165
## 3 cloud            2
## 4 cloud shadow     4
## 5 snow/ice        45

Okay, if we use only “high confidence”, this seems a little better.

How many of these outliers have near-pixel clouds (as measured by ST_CDIST)?

There are 10 labels (11.9% of oultiers) that aren’t “cloud” in the outlier dataset that have a cloud distance <500m and 108 labels (6.6%) in the whole dataset that have a cloud distance <500m. Since this is about the same portion of labels, I don’t think this is terribly helprul.

How many of the outliers have high cloud cover, as reported by the scene-level metadata? Note, we don’t have the direct scene cloud cover associated with individual labels, rather a list of the scene level cloud cover values associated with the AOI.

The outlier dataset contains 0 (0%) where the max cloud cover was > 75% and 0 (0%) where the mean cloud cover was > 60%. The filtered dataset contains 0 (0%) where max was >75% and 0 (0%) where the mean cloud cover was > 60%. This is a bit more uneven of a comparison, but I don’t think this is enough to filter scenes by scene-level cloud cover.

Training dataset implications

For the purposes of training data, I think we can throw out the data from the 1988-11-23 scene, but I don’t see a reason to throw out any other outliers at this time.

LS5_training_labels <- LS5_class_analysis %>% 
  filter(date != "1988-11-23")

Testing for inter-class differences

We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.

Kruskal-Wallis assumptions:

  1. Data are non-Normal or have a skewed distribution
  2. There must be at least two independent groups.
  3. Data have a similar distribution across groups.
  4. Data are independent, the groups shouldn’t have a relationship to one another
  5. Each group should contain at least 5 observations

ANOVA assumptions:

  1. data are distributed normally
  2. data are independent
  3. variance across groups is similar

We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.

In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:

  1. Are the data normally distributed (Shapiro-Wilkes)?
  2. Are there outliers that may impact interpretation?
  3. If data is non-normal, perform Kruskal-Walis test; otherwise ANOVA
  4. if the null is rejected (and there is a difference in at least one class), perform post-hoc test for pairwise comparison (Dunn test for both)

With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:

## # A tibble: 9 × 9
##   band  group1          group2    n1    n2 statistic       p  p.adj p.adj.signif
##   <chr> <chr>           <chr>  <int> <int>     <dbl>   <dbl>  <dbl> <chr>       
## 1 SR_B1 darkNearShoreS… offSh…   160   292      1.47 0.141   1      ns          
## 2 SR_B2 darkNearShoreS… offSh…   160   292     -1.10 0.273   1      ns          
## 3 SR_B3 darkNearShoreS… light…   160   278      2.59 0.00953 0.0953 ns          
## 4 SR_B4 darkNearShoreS… light…   160   278      1.45 0.146   1      ns          
## 5 SR_B5 darkNearShoreS… light…   160   278      1.90 0.0579  0.579  ns          
## 6 SR_B5 darkNearShoreS… offSh…   160   292     -2.77 0.00560 0.0560 ns          
## 7 SR_B7 darkNearShoreS… light…   160   278      1.42 0.156   1      ns          
## 8 SR_B7 darkNearShoreS… offSh…   160   292     -2.65 0.00799 0.0799 ns          
## 9 SR_B7 offShoreSedime… openW…   292   430     -1.48 0.138   1      ns

There is some consistency here: “darkNearShoreSediment” is often not different from other sediment types by band. It is entirely possible that band interactions overpower these non-significant differences.

LS5_training_labels %>% 
  mutate(class_short = case_when(class == "darkNearShoreSediment" ~ "DNSS",
                                 class == "lightNearShoreSediment" ~ "LNSS",
                                 class == "offShoreSediment" ~ "OSS",
                                 TRUE ~ class)) %>% 
ggpairs(., columns = LS57_ee, aes(color = class_short)) + 
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_few()

There are definitely some varying patterns here, let’s zoom in on the sediment classes.

LS5_training_labels %>% 
  mutate(class_short = case_when(class == "darkNearShoreSediment" ~ "DNSS",
                                 class == "lightNearShoreSediment" ~ "LNSS",
                                 class == "offShoreSediment" ~ "OSS",
                                 TRUE ~ class)) %>% 
  filter(grepl("Sediment", class)) %>% 
ggpairs(., columns = LS57_ee, aes(color = class_short)) + 
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_few()

Okay, this seems sensical as to why there is no significant difference in many of the band-level “darkNearShoreSediment” labels - there’s a lot of overlap in ranges. Looking at these scatter plot matrices, I do think there are likely different enough patterns when considering multiple bands that ML should be able to pick up on subtle differences.